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Abstract. The vast number of interleavings that a concurrent program 
can have is typically identified as the root cause of the difficulty of auto- 
matic analysis of concurrent software. Weak memory is generally believed 
to make this problem even harder. We address both issues by modelling 
programs' executions with partial orders rather than the interleaving se- 
mantics (SC). We implemented a software analysis tool based on these 
ideas. It scales to programs of sufficient size to achieve first-time formal 
verification of non-trivial concurrent systems code over a wide range of 
models, including SC, Intel x86 and IBM Power. 



1 Introduction 

Automatic analysis of concurrent programs is a practical challenge. Hardly any 
of the very few existing tools for concurrency will verify a thousand lines of 
code [21) . Most papers name the number of thread interleavings that a concur- 
rent program can have as a reason for the difficulty. This view presupposes an 
execution model, namely Sequential Consistency (SC) [17], where an execution 
is a total order (more precisely an interleaving) of the instructions from different 
threads. The choice of SC as the execution model poses at least two problems. 

First, the large number of interleavings modelling the executions of a program 
makes their enumeration intractable. Context bounded methods 59 54 45 23 
(which are unsound in general) and partial order reduction [56 31 26J can reduce 
the number of interleavings to consider, but still suffer from limited scalabil- 
ity. Second, modern multiprocessors (e.g., Intel x86 or IBM Power) serve as a 
reminder that SC is an inappropriate model. Indeed, the weak memory models 
implemented by these chips allow more behaviours than SC. 

We address these two issues by using partial orders to model executions, 
following [58 64 10 57 . We also aim at practical verification of concurrent pro- 
grams [17119123] . Rarely have these two communities met. Notable exceptions 
are [61162] , forming with [T3] the closest related work. We show that the explicit 
use of partial orders generalises these works to concurrency at large, from SC to 
weak memory, without affecting efficiency. 

Our method is as follows: we map a program to a formula consisting of two 
parts. The first conjunct describes the data and control flow for each thread of 



the program; the second conjunct describes the concurrent executions of these 
threads as partial orders. We prove that for any satisfying assignment of this 
formula there is a valid execution w.r.t. our models; and conversely, any valid 
execution gives rise to a satisfying assignment of the formula. 

Thus, given an analysis for sequential programs (the per-thread conjunct), 
we obtain an analysis for concurrent programs. For programs with bounded 
loops, we obtain a sound and complete model checking method. Otherwise, if 
the program has unbounded loops, we obtain an exhaustive analysis up to a 
given bound on loop unrollings, i.e., a bounded model checking method. 

To experiment with our approach, we implement a symbolic decision proce- 
dure answering reachability queries over concurrent C programs w.r.t. a given 
memory model. We support a wide range of models, including SC, Intel x86 and 
IBM Power. To exercise our tool w.r.t. weak memory, we verify 4500 tests used 
to validate formal models against IBM Power chips [60 50 . Our tool is the first 
to handle the subtle store atomicity relaxation |4 specific to Power and ARM. 

We show that mutual exclusion is not violated in a queue mechanism of the 
Apache HTTP server software. We confirm a bug in the worker synchronisation 
mechanism in PostgreSQL, and that adding two fences fixes the problem. We 
verify that the Read-Copy- Update mechanism of the Linux kernel preserves data 
consistency of the object it is protecting. For all examples we perform the analysis 
for a wide range of memory models, from SC to IBM Power via Intel x86. 

We provide the sources of our tool, our experimental logs and our benchmarks 
at http://www.cprover.org/wpo. 



2 Related Work 



We start with models of concurrency, then review tools proving the absence of 
bugs in concurrent software, organised by techniques. 

Models of concurrency Formal methods traditionally build on Lamport's SC |47j . 
A year earlier, Lamport defined happens-before models [46] . The happens-before 
order is the smallest partial order containing the program order and the relation 
between a write, and a read from this write. 

These models seem well suited for analyses relative to synchronisation, e.g., [22125140] . 
because the relations they define are oblivious to the implementation of the id- 
ioms. Despite happens-before being a partial order, most of [46] explains how 
to linearise it. Hence, this line of work often relies on a notion of total orders. 
Partial orders, however, have been successfully applied in verification in the con- 
text of Petri nets [53 a , which have been linked to software verification in [41] for 
programs with a small state space. 

We (and [15129161162] ) reuse the clocks of [46] to build our orders. Yet we 
do not aim at linearisation or a transitive closure, as this leads to a polynomial 
overhead of redundant constraints. 

Our work goes beyond the definition and simulation of memory models [32 37 63 60 50 . 
Implementing an executable version of the memory models is an important step, 
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but we go further by studying the validity of systems code in C (as opposed to 
assembly or toy languages) w.r.t. both a given memory model and a property. 

The style of the model influences the verification process. Memory models 
roughly fall into two classes: operational and axiomatic. The operational style 
models executions via interleavings, with transitions accessing buffers or queues, 
in addition to the memory (as on SC). Thus this approach inherits the limitations 
of interleaving-based verification. For example, [5] (restricted to Sun Total Store 
Order, TSO) bounds the number of context switches. 

Other methods use operational specifications of TSO, Sun Partial Store 
Order (PSO) and Relaxed Memory Order (RMO) to place fences in a pro- 
gram [44143 49 . Abdulla et al. 3 address this problem on an operational TSO, 
for finite state transition systems instead of programs. The methods of [34 43 
have, in the words of [45] , "severely limited scalability" . The dynamic technique 
presented in [45] scales to 771 lines but does not aim to be sound: the tool picks 
an invalid execution, repairs it, then iterates. 

Axiomatic specifications categorise behaviours by constraining relations on 
memory accesses. Several hardware vendors adopt this style |H2j of specification; 
we build on the axiomatic framework of [5] (cf. Sec. [3]). CheckFence [Hj also uses 
axiomatic specifications, but does not handle the store atomicity relaxation of 
Power and ARM. 

Running example Below we use Prog. [T] 
(from the TACAS Software Verification 
Competition [11]) as an illustration. The 
shared variables x and y can reach the (2N)- 
th Fibonacci number, depending on the in- 
terleaving of thrl and thr2. Prog. [T] permits 
at least 0(2 6N ) interleavings of thrl and 
thr2. In each loop iteration, thrl reads x and 
then y, and then writes x; thr2 reads y and x, 
and then writes y. Each interleaving of these 
two writes yields a unique sequence of shared 
memory states. Swapping, e.g., the read of y 
in thr2 with the write of x in thrl does not 
affect the memory states, but swapping the 
accesses to the same address does. 

Interleaving tools Traditionally, tools are based on interleavings, and do not 
consider weak memory. By contrast, we handle weak memory by reasoning in 
terms of partial orders. 

Explicit- state model checking performs a search over states of a transition 
system. SPIN [36], VeriSoft [30] and Java PathFinder 35 38 implement this 
approach; they adopt various forms of partial order reduction (POR) to cope 
with the number of interleavings. 

POR reduces soundly the number of interleavings to study [56131126] by 
observing that a partial order gives rise to a class of interleavings [51], then 



#define N 5 

int x=l, y=l; 

void thrlQ { 

for(int k=0; k<N; ++k) 
x=x+y; } 

void thr2() { 

for(int k=0; k<N; ++k) 

y=y+x; } 

int main() { 
start_thread(thrl); 
start_thread(thr2); 

assert(x<=144 && y< = 144); 
return 0; } 

Prog. 1. Fibonacci from [11] 
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picking only one interleaving in each class. Prog.Q]is an instance where the effect 
of POR is limited. We noted in Sec. [5] that amongst the 0(2 6N ) interleavings 
permitted by Prog. [TJ only the interleavings of the writes give rise to unique 
sequences of states. Hence distinct interleavings of the threads representing the 
same interleavings of the writes are candidates for reduction. POR reduces the 
number of interleavings by at least 2 2N , but 0(2 4N ) interleavings remain. 

Explicit-state methods may fail to cope with large state spaces, even in a 
sequential setting. Symbolic encodings [13j can help, but the state space of- 
ten needs further reduction using, e.g., bounded model checking (BMC) [12] or 
predicate abstraction [33] . These techniques may again also be combined with 
POR. ESBMC [19] implements BMC. An instance of Prog. ED has a fixed N, i.e., 
bounded loops. Thus BMC with N as bound is sound and complete for such an 
instance. ESBMC verifies Prog. [Q for N = 10 within 30mins (cf. Sec. H Fig-HJ). 
SatAbs [17] uses predicate abstraction in a CEGAR loop; it completes no more 
than N = 3 in 30 mins as it needs multiple predicates per interleaving, resulting 
in many refinement iterations. Our approach easily scales to, e.g., N = 50, in less 
than 20 s, and more than N=300 within 30 mins, as we build only a polynomial 
number of constraints, at worst cubic in the number of accesses to a given shared 
memory address. 

Non-interleaving tools Another line of tools is not based on interleavings. The 
existing approaches do not handle weak memory and arc either incomplete (i.e., 
fail to prove the absence of a bug) or unsound (i.e., might miss a bug due to the 
assumptions they make). 

Thread-modular reasoning 39 24 28 27 34 is sound, but usually incomplete. 
Each read presumes guarantees about the values provided by the environment. 
Empty guarantees amount to fully non-deterministic values, thus this is a triv- 
ially sound approach. Our translation of Sec.[4]corresponds to empty guarantees. 
The constraints of Sec. [5j however, make our encoding complete. 

In Prog. Q] if we guarantee x<=144 && y<=144, the problem becomes trivial, 
but finding this guarantee automatically is challenging. Threader [34J fails for 
N=l (cf. Sec. E Fig. I}. 

Context bounded methods fix an arbitrary bound on context switches 59 54 45 23 . 
This supposes that most bugs happen with few context switches. Our method 
does not make this restriction. Moreover, we believe that there is no obvious 
generalisation of these works to weak memory, other than instrumentation as [5] 
does for TSO, i.e., adding information to a program so that its SC executions 
simulate its weak ones. We used our tool in SC mode, and applied the instru- 
mentation of [5] to it. On average, the instrumentation is 9 times more costly 
(cf. Seel Fig.©. 

In Prog. [TJ we need at least N context switches to disprove the assertion 
assert (x<=143 && y<=143) (or any upper bound to x and y that is the (2N)-th 
Fibonacci number minus 1). The hypothesis of the approach (i.e., small context 
bounds suffice to find a bug) does not apply here; Poirot fails for N> 1 (cf Sec.[6l 
Fig. HI). 
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Allowed? rl=0; r2=0 



(6)RyO 

Fig. 1. Store Buffering (sb) 
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Our work relates the most to 14 29 61 62\; we discuss [29] below and de- 
tail 14 61 62 in Sec. 15.71 These works use axiomatic specifications of SC to 
compose the distinct threads. CheckFence [14] models SC with total orders and 
transitive closure constraints; [61162] use partial orders like us. [ITl 62 note re- 
dundancies of their constraints, but do not explain them; our semantic founda- 
tions (Sec. [3]) allow us both to explain their redundancies and avoid them (cf. 
Sec. 

The encodings of [14161162) are 0(N 3 ) for N shared memory accesses to 
any address; [55] is quadratic, but in the number of threads times the number 
of per-thread transitions, which may include arbitrary many local accesses. Our 
encoding is 0(M 3 ), with M the maximal number of events for a single address. By 
contrast, the encodings of 29 61 62 quantify over all addresses. Prog.[T]has two 
addresses only, but the difference is already significant: (6N) 3 for 14 29 61 62 
vs. 2 x (3M) 3 in our case, i.e. 1/4 of the constraints (cf. Sec. [U Fig. [7]for other 
case studies). 

3 Context: Axiomatic Memory Model 

We use the framework of [8], which provably embraces several architectures: 
SC (47], Sun TSO (i.e. the x86 model [55]), PSO and RMO, Alpha, and a frag- 
ment of Power. We present this framework via litmus tests, as shown in Fig. [T] 

The keyword allowed asks if the architecture permits the outcome "rl=l ; r2=0 ; 
r3=l;r4=0". This relates to the event graphs of this program, composed of re- 
lations over read and write memory events. A store instruction (e.g. x <— 1 on 
Po) corresponds to a write event ( (a) Wxl), and a load (e.g. rl <— y on Po) to a 
read ( (b) RyO). The validity of an execution boils down to the absence of certain 
cycles in the event graph. Indeed, an architecture allows an execution when it 
represents a consensus amongst the processors. A cycle in an event graph is a 
potential violation of this consensus. 

If a graph has a cycle, we check if the architecture relaxes some relations. The 
consensus ignores relaxed relations, hence becomes acyclic, i.e. the architecture 
allows the final state. In Fig.[T] on SC where nothing is relaxed, the cycle forbids 
the execution. x86 relaxes the program order (po in Fig. [T]) between writes and 
reads, thus a forbidding cycle no longer exists for (a, b) and (c, d) are relaxed. 

Executions Formally, an event is a read or a write memory access, composed of 
a unique identifier, a direction R for read or W for write, a memory address, 
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Allowed? rl=l; r2=0; r3=l; r4=0; 



(a)rl«— x (c)r3<— y (e) x <— 1 
(&) r2 <S- y (d) r4 <- x 




(/)y«-i 



po 




Fig. 2. Independent Reads of Independent Writes (iriw) 



and a value. We represent each instruction by the events it issues. In Fig. [2J we 
associate the store x <r- 1 on processor P^ to the event (e) Wxl. 

We associate the program with an event structure E = (E, po) , composed of 
its events E and the program order po, a per- processor total order. We write dp 
for the relation (included in po, the source being a read) modelling dependencies 
between instructions, e.g. an address dependency occurs when computing the 
address of a load or store from the value of a preceding load. 

Then, we represent the communication between processors leading to the final 
state via an execution witness X = (ws, rf ) , which consists of two relations over 
the events. First, the write serialisation ws is a per-address total order on writes 
which models the memory coherence widely assumed by modern architectures 
. It links a write w to any write w' to the same address that hits the memory 
after w. Second, the read- from relation rf links a write w to a read r such that 
r reads the value written by w. 

We include the writes in the consensus via the write serialisation. Unfortu- 
nately, the read-from map does not give us enough information to embed the 
reads as well. To that aim, we derive the from-read relation fr from ws and rf. A 
read r is in fr with a write w when the write w' from which r reads hit the memory 
before w did. Formally, we have: (r, w) G f r = 3w' , (w',r) 6 rf A (w' ,w) G ws. 

In Fig. [21 the outcome corresponds to the execution on the right if each 
memory location and register initially holds 0. If rl=l in the end, the read (a) 
read its value from the write (e) on Pi, hence (e, a) £ rf. If r2=0, the read (b) read 
its value from the initial state, thus before the write (/) on P3, hence (b, /) € fr. 
Similarly, we have (/, c) £ rf from r3=l, and (d, e) € fr from r4=0. 

Relaxed or safe A processor can commit a write w first to a store buffer, then to 
a cache, and finally to memory When a write hits the memory, all the processors 
agree on its value. But when the write w transits in store buffers and caches, 
a processor can read its value through a read r before the value is actually 
available to all processors from the memory. In this case, the read-from relation 
between the write w and the read r does not contribute to the consensus, since 
the reading occurs in advance. 

We model this by some subrelation of the read-from rf being relaxed, i.e. 
not included in the consensus. When a processor can read from its own store 
buffer [3] (the typical TSO/x86 scenario), we relax the internal read-from rfi. 
When two processors Po and Pi can communicate privately via a cache (a case 
of write atomicity relaxation [4]), we relax the external read-from rfe, and call 
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the corresponding write non- atomic. This is the main particularity of Power or 
ARM, and cannot happen on TSO/x86. 

Some program-order pairs are relaxed (e.g. write-read pairs on x86), i.e. only 
a subset of po is guaranteed to occur in this order. 

When a relation is not relaxed, we call it safe. Architectures provide special 
fence (or barrier) instructions, to prevent weak behaviours. Following jS], the 
relation fence C po induced by a fence is non- cumulative when it orders certain 
pairs of events surrounding the fence, i.e. fence is safe. The relation fence is 
cumulative when it makes writes atomic, e.g. by flushing caches. The relation 
fence is A-cumulative (resp. B-cumulative) if rfe; fence (resp. fence; rfe) is safe. 
When stores are atomic (i.e. rfe is safe), e.g. on TSO, we do not need cumulativity. 

Architectures An architecture A determines the set safe a of the relations safe on 
A, i.e. the relations embedded in the consensus. Following [8], we consider the 
write serialisation ws and the from-read relation fr to be always safe. SC relaxes 
nothing, i.e. rf and po are safe. TSO authorises the reordering of write-read pairs 
and store buffering (i.e. powR and rfi are relaxed) but nothing else. We denote 
the safe subset of read-from, i.e. the read-from relation globally agreed on by all 
processors, by grf. 

Finally, an execution (E, X) is valid on A when the three following condi- 
tions hold. 1. SC holds per address, i.e. the communication and the program 
order for accesses with same address po-loc are compatible: uniproc(i?, X) = 
acyclic(ws U rf U fr U po-loc). 2. Values do not come out of thin air, i.e. there is no 
causal loop: thin(i?, X) = acyclic(rf U dp). 3. There is a consensus, i.e. the safe re- 
lations do not form a cycle: consensus(_B, X) = acyclic((ws U rf U fr U po) n safest ). 
Formally: valid^-E, X) = uniproc(£', X) A thin(-E, X) A consensus(-E, X). 

From the validity of executions we deduce a comparison of architectures: We 
say that an architecture A2 is stronger than another one A\ when the executions 
valid on Ai are valid on A\ . Equivalently we would say that A\ is weaker than 
A2. Thus, SC is stronger than any other architecture discussed above. 

4 Symbolic event structures 

For an architecture A and one execution witness X, the framework of Sec. [3] 
determines if X is valid on A. To prove reachability of a program state, we 
need to reason about all its executions. To do so efficiently, we use symbolic 
representations capturing all possible executions in a single constraint system. 
We then apply SAT or SMT solvers to decide if a valid execution exists for A, 
and, if so, get a satisfying assignment corresponding to an execution witness. 

As said in Sec. [IJ we build two conjuncts. The first one, ssa, represents the 
data and control flow per thread. The second, pord, captures the communica- 
tions between threads (cf. Sec. [5]). We include a reachability property in ssa; the 
program has a valid execution violating the property iff ssa A pord is satisfiable. 

We mostly use static single assignment form (SSA) of the input program to 
build ssa (cf. 42 for details). In this SSA variant, each equation is augmented 
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with a guard: the guard is the disjunction over all conjunctions of branching 
guards on paths to the assignment. To deal with concurrency, we use a fresh 
index for each occurrence of a given shared memory variable, resulting in a fresh 
symbol in the formula. CheckFence Q3] and [61162) use a similarly modified 
encoding. 

Together with ssa, we build a symbolic event structure (ses). As detailed 
below, it captures basic program information needed to build the second conjunct 
pord in Sec.[SJ Fig. [3] illustrates this section: the formula ssa on top corresponds 
to the ses beneath. 



main P Pi Pi Pz 

XQ — 

A yo = A rl i = Xi A r3o = J/2 A x 3 = 1 A y 3 = 1 

A r2\ =yi A r4 2 = x 2 

A prop 



(io) Wxi 
(»x) Wyj/o 



(a) Rxzi (c) Ryy 2 (e) Wxi 3 (/) Wyy 3 

(b) Ryj/i (d) Rxa;2 



Fig. 3. The formula ssa for iriw (Fig. [2J with prop = (rlj = 1 A r2j = A r3o 

1 A r4j) = 0), and its ses (guards omitted since all true) 



Static single assignment form ( SSA ) To encode ssa we use a variant of SSA 20 
and loop unrolling. The details of this encoding are in [?2] , except for differences 
in the handling of shared memory variables, as explained below. 

In SSA, each occurrence of a program variable is annotated with an index. 
We turn assignments in SSA form into equalities, with distinct indexes yielding 
distinct symbols in the resulting equation. For example, the assignment x:=x+l 
results in the equality x\ = xq + 1. We use unique indexes for assignments 
in loops via loop unrolling: repeating x:=x+l twice yields Xi = xq + 1 and 
%2 — %i + 1. Control flow join points yield additional equations involving the 
guards of branches merging at this point (see [42] for details) . 

In concurrent programs, we also need to consider join points due to com- 
munication between threads, i.e., concurrent SSA form (CSSA) 48 . To deal 
with weaker models, we use a fresh index for each occurrence of a given shared 
memory variable, resulting in a fresh symbol in the formula. Thus, each occur- 
rence may take non-deterministic values, i.e. this approach over-approximates 
the behaviours of a program. If x is shared in the above example, the modified 
SSA encoding of the second loop unrolling becomes x% = x-2 + 1, breaking any 
causality between the first loop iteration (encoded as x\ = xq+1) and the second 
one. Sinha and Wang [61162] use the same approach, but since they consider SC 
only, their use of fresh indexes may produce more symbols than necessary. 
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By adding the negation of the reachability property to be checked to our 
(over-approximating) SSA equations, we obtain a formula ssa that is satisfiable 
if there exists a (concurrent) counterexample violating the property. As this is an 
over-approximation, the converse need not be true, i.e., a satisfying assignment 
of ssa may constitute a spurious counterexample. Sec. [5] restores precision using 
the pord constraints derived from the ses. 

In ssa, memory addresses map to unique symbols via the (symbolic) pointer 
dereferencing of [421 Sec. 4]. In the weak memory case, we ensure this by using 
analyses sound for this setting [BJ. 

The top of Fig. Ogives ssa for Fig.[2j We print a column per thread, vertically 
following the control flow, but it forms a single conjunction. Each occurrence of 
a program variable carries its SSA index as a subscript. Each occurrence of the 
shared memory variables x and y has a unique SSA index. Here we omit the 
guards, as this program does not use branching or loops. 

From SSA to symbolic event structures A symbolic event structure (ses) 7 = 
(§, po) is a set § of symbolic events and a symbolic program order po. A symbolic 
event holds a symbolic value instead of a concrete one as in Sec. [3] We define 
g(e) to be the Boolean guard of a symbolic event e, which corresponds to the 
guard of the SSA equation as introduced above. We use these guards to build 
the executions of Sec. [3] a guard evaluates to true if the branch is taken, false 
otherwise. The symbolic program order po(7) gives a list of symbolic events per 
thread of the program. The order of two events in po(7) gives the program order 
in a concrete execution if both guards are true. 

Note that po(7) is an implementation-dependent linearisation of the branch- 
ing structure of a thread, induced by the path merging applied while constructing 
the SSA form. For instance, if e± then e 2 else could be linearised as ei- 
ther (ei, e2, 63) or (ei, e 2 ) as any two events of a concrete execution (e\ and 
e 2 , or e\ and 63) remain in program order. The original branching structure, i.e., 
the unlinearised symbolic program order induced by the control flow graph, is 
maintained in the relation po-br(7). For the above example, po-br(7) contains 
(ei,e 2 ) and (ei,e 3 ). 

We build the ses 7 alongside the SSA form, as follows. Each occurrence of 
a shared program variable on the right-hand side of an assignment becomes a 
symbolic read, with the SSA-indexed variable as symbolic value, and the guard 
is taken from the SSA equation. Similarly, each occurrence of a shared program 
variable on the left-hand side becomes a symbolic write. Fences do not affect 
memory states in a sequential setting, hence do not appear in SSA equations. 
We simply add a fence event to the ses when we see a fence. We take the order 
of assignments per thread as program order, and mark thread spawn points. 

At the bottom of Fig. |31 we give the ses of iriw. Each column represents the 
symbolic program order, per thread. We use the same notation as for the events 
of Sec. [31 but values are SSA symbols. Guards are omitted again, as they all are 
trivially true. We depict the thread spawn events by starting the program order 
in the appropriate row. Note that we choose to put the two initialisation writes 
in program order on the main thread. 
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From symbolic to concrete event structures To relate to the models of Sec. [3J we 
concretise symbolic events. A satisfying assignment to ssa A pord, as computed 
by a SAT or SMT solver, induces, for each symbolic event, a concrete value 
(if it is a read or a write) and a valuation of its guard (for both accesses and 
fences). A valuation V of the symbols of ssa includes the values of each symbolic 
event. Since guards are formulas that are part of ssa, V allows us to evaluate the 
guards as well. For a valuation V, we write conc(e s ,V) for the concrete event 
corresponding to e s , if there is one, i.e., if g(e s ) evaluates to true under V. 

The concretisation of a set § of symbolic events is a set E of concrete events, 
as in Sec. [3J s.t. for each e s E there is a symbolic version e s in S. We write 
conc(§, V) for this concrete set E. The concretisation conc(r s , V) of a symbolic re- 
lation r s is the relation {(x,y) | 3(x s ,y s ) E r s .x = conc(a; s , V) A y = conc(y s , V)}. 

Given an ses 7, conc(7, V) is the event structure (cf. Sec. [3]), whose set of 
events is the concretisation of the events of 7 w.r.t. V, and whose program order 
is the concretisation of po(7) w.r.t. V. For example, the graph of Fig. [2] (erasing 
the rf and fr relations) is a concretisation of the ses of iriw (cf. Fig. [3]). 

5 Encoding the communication and weak memory 
relations symbolically 

For an architecture A and an ses 7, we need to represent the communications 
(i.e., rf, ws and fr) and the weak memory relations (i.e., ppo^grf^ and ab^) 
of Sec. [3J We encode them as a formula pord, s.t. ssa A pord is satisfiable iff 
there is an execution valid on A violating the property encoded in ssa. We avoid 
transitive closures to obtain a small number of constraints. We start with an 
informal overview of our approach, then describe how we encode partial orders, 
and finally detail the encoding for each relation of Sec. [3j 

Overview We present our approach on iriw (Fig. [5]) and its ses 7 (Fig. [3J. In 
Fig. [2l we represent only one possible execution, namely the one corresponding 
to the (non-SC) final state of the test at the top of the figure. In this section, we 
generate constraints representing all the executions of iriw on a given architec- 
ture. We give these constraints, for the address x in Fig. 2] in the SC case (for 
brevity we skip y, analogous to x). Weakening the architecture removes some 
constraints: for Power, we omit the (rf-grf) and (ppo) constraints. For TSO, all 
constraints are the same as for SC. 

In Fig. [4l each symbol e a & is a clock constraint, representing an ordering 
between the events a and b. A variable s wr represents a read-from between the 
write w and the read r. 

The constraints of Fig. [H represent the preserved program order (cf. Sec. l5.4[) . 
e.g., on SC or TSO the read-read pairs (a, b) on P (ppo Pq) and (c, d) on Pi 
(ppo Pi), but nothing on Power. We generate constraints for the read-from (cf. 
Sec. 15.11) . for example (rf-some x); the first conjunct Si Qa Vs ea concerns the read 
a on Pq. This means that a can read either from the initial write io or from 
the write e on P2. The selected read-from pair also implies equalities of the 
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(rf-val x) 


(Si a => XI = x ) A (s io 4 =4> x 2 — x )A 
(s ea => an = £3) A (s ed =>• a; 2 = £3) 


(rf-grf x) 


(^>io<i = * > ^o a ) ^ (^ea --t* C ea )A 
(Si d => C i() d) A (S e d => C e d) 


(rf-some x) 


(Si a V S ea ) A (s iQ d V S ed ) 


(ws x) 




(fr x) 


((si a A Ci oe ) =>■ c ae ) A ((s io d A c io e) => C de )A 

((Sea A C eio ) => Cai ) A ((s e d A C e i„ ) => C d i ) 


(ppo main) 


Ci h (PPO Po) C a 6 (ppo Pi) C cd 



Fig. 4. Partial order constraints for address x in Fig. [2] on SC 

values written and read (rf-val x): for instance, Si oa implies that x\ equals the 
initialisation xq . The architecture- independent constraints for write serialisation 
(cf. Sec. l5.2p and from-read (cf. Sec. I5.3P are specified as (ws x) and (fr x); (ws y) 
and (fr y) are analogous. As there are no fences in iriw, we do not generate any 
memory fence constraints (cf. Sec. 15. 5|) . 

We represent the execution of Fig. [5] as follows. For (e, a) and (io,d) € grf, 
we have the constraint s ea => c ea and Si d => Ci 0( j in (rf-grf a;). This means that 
a reads from e (as witnessed by s ea ), and that we record that e is ordered before 
a in grf (as witnessed by c ea ); idem for c? and To represent (ci, e) € fr, we pick 
the appropriate constraint in (fr a;), namely (si d A Ci oe ) c<j e . This reads "if d 
reads from iq and io is ordered before e (in ws, because io and e are two writes 
to x), then d is ordered before e (in fr)." 

Together with (ppo Po) and (ppo Pi), these constraints represent the exe- 
cution in Fig. [2] We cannot find a satisfying assignment of these constraints, as 
this leads to both a before b (by (ppo Po)) and b before a (by (fr y), (rf-grf y), 
(ppo Pi), (fr x) and (grf x)). On Power, however, we neither have the ppo nor 
the grf constraints, hence we can find a satisfying assignment. 

Symbolic partial orders We associate each symbolic event x of an ses 7 with a 
unique clock variable clocks (cf. [46161) ) ranging over the naturals. For two events 
x and y, we define the Boolean clock constraint as c xy = (g(x) A g(y)) => clocks < clock. 
("<" being less-than over the integers). We encode a relation r over the symbolic 
events of 7 as the formula <p(r) defined as the conjunction of the clock constraints 
c xy for all (x,y) G r, i.e., 0(r) = A , v r 'V„. 

Let C be a valuation of the clocks of the events of 7. Let V be a valuation 
of the symbols of the formula ssa associated to 7. As noted in Sec. |H V gives us 
concrete values for the events of 7, and allows us to evaluate their guards. We 
show below that (C, V) satisfies <p(r) iff the concretisation of r w.r.t. V is acyclic, 
provided that this relation has finite prefixes. 

A prefix of x in a relation r is a (possibly infinite) list S = [xq, x\, X2, ■ ■ ■] 
s.t. x — xq and for all i, (2^+1,2^) G r (observe that the prefix is reversed 
w.r.t. the order imposed by the relation). The relation r has finite prefixes if 
for each x, there is a bound I G N to the cardinality of the prefixes of x in r. 
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We write card(5) for the cardinality of a list S = [xo, x\, x%, . . .], i.e., card(S') = 
card({x | 3i.x = Xi}). We write pref (r, x) for the set of prefixes of x in r. Formally, 
r has finite prefixes when Vx.3l.VS G pref (r, x). card(5) < /. In our proofs and in 
Alg.|3]we denote the concatenation of two lists S\ and S2 by S1++S2. 

In the following, we allow symbolic relations with infinite prefixes provided 
their concretisations have finite prefixes. Thus we do not consider executions 
with an infinite past, or running for more steps than the cardinality of N. Our 
first lemma justifies why checking the acyclicity of a concrete relation amounts 
to checking the satisfiability of the formula encoding this relation symbolically: 

Lemma 1. (C, V) satisfies <j>(r) iff conc(r, V) is acyclic and has finite prefixes. 

Proof. =$■: We let r c = conc(r, V). One can show by induction that (*) if (C, V) 
satisfies <fi{r) then for all (x,y) G r c + . c xy is true. Now, suppose <fi(r) satisfied, 
and as a contradiction, r c cyclic, i.e., 3x.(x,x) G r c + . Thus c xx is true by (*); 
this contradicts the irreflexivity of < over the integers. 

Now we show that r c has finite prefixes, i.e., for each x we give a bound I over 
all S G pref(r c , a;). As a contradiction take S = [xq, . . .x n ] G pref (r c , x) s.t. x — 
xq and card(5) > clock x . Thus for all i, we have (a;j+i, Xi) G r c and clock Xi+1 < 
clock Xi by (*). Since n > card(S'), card(5') > clock x and clock Xo = clock x , we 
have clock Xn < 0, which contradicts the fact that our clocks are naturals. Thus 
for each x we can take I = clock x . 

<=: Let r c = conc(r, V). For all e s.t. g(e) = false, take clock e = 0. Thus c xy 
is true if g(x) or g(y) is false. Now, have (x,y) G r with both guards true, i.e., 
(x,y) G r c . Take clock x to be the maximal cardinality of the S in pref(r c ,x), 
idem for y. We want to prove clock x < clock y . Take S s.t. clock x = card(S'). 
From (x,y) G r c , we have [y]++S G pref(r c ,y). Now, card([y]++5) < clocky 
by maximality of clocky. It suffices to prove card(5) < card([j/]++5). Suppose 
card(S') > card(ky]H — \-S). Then y appears in S. Thus (y,x) G r c + since S is a 
prefix of x; as (x,y) G r c by hypothesis, we have a cycle in r c . 

The formula <fr(ri U r2) is equivalent to <f)(ri) A (frfo). Thus we encode unions 
of relations, e.g., ghb A = ws U fr U grf A U ppo^ Uab^, as the conjunction of their 
respective encodings. By Lem. [U the acyclicity of ghb A corresponds to the satis- 
fiability of </>(ghb s ), where ghb s is a symbolic encoding of ghb A . To form 0(ghb s ), 
we form the conjunction of the formulas <j>(r), for r being a symbolic encoding of 
ws, fr, grf^, ppo A and ab A . 

We now present these encodings, in that order. Sec. [3] also relies on the 
program order per location for the uniproc check, and the dependencies for 
the thin check, omitted for brevity. We compute them alongside the preserved 
program order; they use independent sets of clock variables, but the same clock 
constraints. 

We define auxiliaries over symbolic events: tid(e) is the thread identifier of e, 
addr(e) the memory address read from or written to (e.g., x for (e) Rxy), and 
val(e) its (symbolic) value. Each algorithm outputs constraints, whose conjunc- 
tion we add to pord. 
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input: 7, A output: C w f , Crf, C gr f 

1 reads := {(a, {n . . . r n }) | r* is read A addr(r-i) = a} 

2 writes := {(a, . . . w n }) \ Ws is write A addr(iOi) = a} 

3 C rf := 0; Cgrf := 0; C„ f := 

4 foreach a s.t. 3R, W.(a, R) 6 reads A (a, W) € writes do 

5 foreach r £ R do 

6 rflsome := 

7 foreach m 6 W do 

8 if (r, tu) po(7) then 

9 rf_some := rf_some U {s wr } 

10 C wf := C wf U {s mr => (g(w) A val(r) = val(w))} 

11 Crf := C r f U {s wr => c„ r } 

12 if (w, r) not relaxed on A and tid(w) 7^ tid(r) then 

13 Cgrf •— Cgrf U {s wr C wr y 

14 C W f := Cwf U {g(r) => V aerfj0 me *} 

Algorithm 1: Constraints for read- from 



For each algorithm we state and prove a lemma about its correctness. These 
follow the scheme of Lem. [I] i.e. we show the encoding correct for any satisfying 
valuation of clocks and ssa. Thus we will introduce symbolic encodings of sets 
r(7), where membership in the set is given by a formula and thus depends on 
the actual valuation under C and V. 

5.1 Read- from 

For an architecture A and an ses 7, Alg. [T] encodes the read-from (resp. safe 
read-from) as the set of constraints C r f (resp. C gr f). Following Sec. 02 we add 
constraints to C gr f depending on: first, the relation being within one thread or 
between distinct threads (derivable from tid(ty) and tid(r)); second, whether A 
exhibits store buffering, store atomicity relaxation, or both. 

Alg. [T] groups the reads and writes by address, in the sets reads and writes 
(lines[T]and[2|). For iriw, reads = {(x, {a, <£}), (y, {b, c})} and writes = {(x, {i 0} e}), (y, /})}• 

The next step forms the potential read-from pairs. To that end, Alg. [T] in- 
troduces a free Boolean variable s wr for each pair (w, r) of write and read to 
the same address (line [9]), unless such a pair contradicts program order (line [8]). 
Indeed, if (w, r) is in rf and (r, w) is in po, this violates the uniproc check of 
Sec. H 

The variable rLsome, initialised in line HI collects the variables s wr in line HO 
For iriw, the memory address x, and the read a, we have rf.some = {s ioa , s ea }, 
i.e., the read a can read either from iq (the initial write to x), or from the write 
e on P 2 . 

Following Sec. G2 each read must read from some write. We ensure this at 
line ll4[ by gathering in C w f , for a given r, the union of all the potential read-from 
s wr collected in rLsome. 
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Going back to iriw, recall from Sec. HI that an event has a guard indicating 
the branch of the program it comes from. In iriw, the guard of a is true (as all 
the others), i.e., the read a is concretely executed. Hence there exists a write 
(either i$ or e) from which a reads, as expressed by the constraint Si oa V s ea 
formed at line [T4l 

If s wr evaluates to true (i.e., r reads from w), we record the value constraint 
val(r) = val(ty) in the set C w f fline [TU|) . For iriw, we obtain the following for x: 
(s ioa =>• xi = x Q ) A (s i(ld => x 2 = x ) A (s ea =>• xi = x 3 ) A (s ed x 2 = x 3 ). 
The constraint Si oa => x\ = xq reads "if Si Qa is true (i.e., a reads from ig) then 
the value x\ read by a equals the value xq written by io-" 

The constraint added to Crf is such that only if s wr evaluates to true, the 
clock constraint c wr is enforced (line ITT]) . For iriw we add the following to Crf, 
for the address x: (s ioa => c ioa ) A (s ea => c ea ) A (s iod =>■ c iod ) A (s ed =>■ c ed ). 

If (w, r) is not relaxed on A, we also add its clock constraint c wr to C gr f 
(line ITS]) . In iriw, all reads read from an external thread. Thus on an architec- 
ture that does not relax store atomicity (i.e., stronger than Power), we add the 
constraints that we added to Crf to C gr f as well. On Power, C gr f remains empty. 

We now write grf^ for both the function over concrete relations given by the 
definition of A as in Sec. [3J and the corresponding function over symbolic rela- 
tions. Given an architecture A, we have grf j4 (r) = {(w, r) £ r | (w, r) is not relaxed on A}. 
For example if A is TSO, all thread-local read-from pairs are relaxed: grf^(r) = 
{(w,r) e r | tid(w) / tid(r)}. We write (w,r) G WR a when w writes to an ad- 
dress a and r reads from the same a, and prf (7) = {(w, r) G [j a WR Q | (r, w) ^ po(7)}. 
We write rf(7) for the set {(w, r) e prf (7) | s wr } (with s wr of Alg. [TJ), and grf(7) 
for grf j4 (rf(7)). Note that we build the external safe read-from (grfe(7)) only, i.e., 
between two events from distinct threads. We compute the internal one as part 
of ppo A , in Alg.H 

Given an ses 7, Alg. Q] outputs Crf, C gr f and C w f. Let WR be a valuation of 
the s wr variables of 7. We write inst(rf(7), WR) (resp. inst(grf(7), WR)) for rf(j) 
(resp. grfe(7)) where WR instantiates the s wr variables (thus rf(7) is a symbolic 
encoding of the set as noted before this sub-section; we use this notation similarly 
in the remainder of this section). We show that Alg.[T]gives the clock constraints 
encoding grf (we omit the corresponding lemma for rf): 

Lemma 2. (C, V, WR) satisfies A c ec f uc r c iff ^0 satisfies 

i) for all r s.t. g(r) is true, there is w s.t. (w,r) 6 inst(rf("/), WR) and 

ii) for all (w,r) G inst(rf("/), WR), g(w) is true and val(u>) = val(r) and 

™) l\(w,r)einst(grfe(~i),WR) C wr- 

Proof. An induction on R, W s.t. (a,R) G reads and (a, W) G writes for some 
address a, then union for all a shows that C gr f— {s wr c wr \ (w,r) G prftpi) D 
grfe A }, and C wf = (J r ls readier) =»■ V^rjg^) s wr }U{s wr ^ (g(w) Aval(r) = 
val(u))) I (w,r) G prf(-y)}. The first component of C w f is equivalent to i); the 
second to ii). C gr f is equivalent to Hi). 

The model described in Sec. [3] suggests that rf must be encoded to be exclu- 
sive, i.e., to link a read to only one write. An explicit encoding thereof, however, 
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input: 7 output: C ms 

1 writes := {(a, {wi . . . w n }) | Wi is write A addr(wi) = a} 

2 C ws := 0; foreach a s.t. 3W.(a,W) G writes do 

3 foreach w G W do 

4 foreach w' G VKs.i. tid(w') 7^ tid(w) do 

5 Cws Cws U { 'Ctahj' =r* Cu/^} 

Algorithm 2: Constraints for write serialisation 



would be redundant, as this is already enforced by ws and fr. Hence it suffices 
to consider at least one write per read, as Alg. [TJ does: 

Lemma 3. uniproc(i?, X) =>• Vr.^(3w ^ w'.(w,r) G rf K (w',r) G rf) 

Proof. By contradiction, have w ^ w' s.t. (w,r) € rf and (w',r) G rf. By totality 
of ws, (w,w') G ws or (w' ,w) G ws. W.l.o.g. have {w,w r ) G ws. Then (r, w') G fr, 
i.e., a cycle in rfU fr: w',r,w', forbidden by uniproc. 

5.2 Write serialisation 

Given an ses 7, Alg. [5] encodes the write serialisation ws as the set of constraints 
C ws . By definition, ws is a total order over writes to a given address. Alg. [5] 
implements the totality by ensuring that for two writes w ^ w' to the same 
address either c ww > or c w i w holds. For implementation reasons we choose to 
express this as ^c WW ' c w > w rather than c WW ' V c W ' W . 

Alg. El groups the writes per address. For each address a and write w to 
a (lines [5] and [3]) we choose another write w' to a (line 01 , and build the dis- 
junction of clock constraints over w and w' (line [5]). For iriw we have writes = 
{(x, {i , e}), (y, {£1, /})}, and the constraints: (^c iae =>• c eio ) A (^c ilf =>• c/ ix ). 

Note that we build the external ws only (wse) . With WW a the pairs of writes 
to the address a, and ws(7) the set {(w, w') G \J a WW a | c w > w = false}, we have 
wse(7) = ws(7) n {(w,w') | tid(w) ^ tid(w')}. We compute the thread-local ws 
as part of ppo^, in Alg. |4j Given an input 7 of Alg. [2l we now characterise the 
clock constraints given by C W5 . Basically we show that Alg. Ogives the clock 
constraints enconding ws. The proof (omitted for brevity) is by induction as for 
Lem.CU 

Lemma 4. (C, V) satisfies A c ec„ 5 c iff ^ satisfies f\^ WtW ,) ewse c ww ' . 

We quantify over all pairs of writes to the same address to build ws. Thus 
for wo, u>i, W2 in ws in a concrete execution, we build (wq,u>i), (11)1,11)2) and the 
redundant {wq,W2) in the symbolic world. This is inherent to the totality of ws. 

5.3 From-read 

Given an ses 7, Alg. [3] encodes from-read as the set of constraints Cf r . Recall 
that (r,w) G fr means 3w'.(w',r) G rf A (w',w) G ws. The existential quantifier 
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input: 7 output: CV 

1 reads := {(a, {Yi . . . r n }) | r; is read A addr(r^) = a} 

2 writes := {(a, {wi . . . w n }) \ Wi is write A addr(iDi) = a} 

3 C fr := 

4 foreach q s.t. 3R, W.(a, W) G writes, (a, R) G reads do 

5 foreach (w,w;') effx Ws.t.w' 7^ 10 do 

6 foreach r £ R with tid(r) 7^ tid(w) do 

7 Cfr := Cfr U {(s»v A C^u A g(w)) => C rw } 

Algorithm 3: Constraints for from-read 



corresponds to a disjunction: \j w r is writc (u/ , r) E rf A (W, to) £ ws. Since this dis- 
junction can be large, which is undesirable in the expression simplification used in 
the implementation, we rewrite it as a conjunction of small implications, each of 
which are simplified in isolation: A^'is write (( r ' E fr (w',r) G rf A (to', to) E ws) 
Thus Alg. |3] encodes from-read as a conjunction of the premise variables s wr of 
C r f and clock variables c ww > of C W5 introduced in Alg. Q] and [5] 

Again, we collect the sets of reads and writes per address. Alg. [3] considers 
triples (w',w,r) of events to the same address, where (to', to) is in the write 
serialisation, and (w',r) is in read- from. We enumerate the pairs of writes in 
line [5j and then pick a read in line [SJ For each such triple we add in line [7] the 
clock constraint c rw under the premise that i) (w',r) E rf, witnessed by s w ' r , 
ii) (w' : to) E ws, witnessed by c w > w , and that hi) the write w actually takes place 
in a concrete execution, i.e., g(w) evaluates to true. 

For iriw all guards are true. For x, we obtain: (si oa A a oe ) c ae ) A ((sj d A 
Ci e) => Cde) A ((s ea A c el() ) ^> c ai0 ) A ((s ed A c e j ) =4> Cd l0 . For example, (s loa A 
reads "if Sj Da is true (i.e., if a reads from irj), and if Ci oe is true (i.e., 
(«o, e) € ws) then c ae is true (i.e., a is in fr before e)." 

Given an ses 7, Alg. [3] outputs Cf r . Note that we compute here the external 
from-read only (fre), and the internal one as part of ppo^, in Alg. 0J We show 
that Alg. [3] gives the clock constraints encoding fr. The (omitted) proof is as for 
Lem.H 



Lemma 5. (C, V, WR) satisfies /\ ceCfr c *j£f (C ^0 satisfies /\^ r 



w)£inst(fre(~() : WR) 



The fr defined above, together with ws, does intro- 
duce possible redundancies: given (tOo,r) E rf with „ 
(wo,wi) E ws and (101,102) G ws : we have both 
(r, wi) E f r and (r, u> 2 ) € f r - but the latter is redun- 
dant as the same ordering is implied by (r, W\) E fr 
and {wx,W2) E ws. We could thus, instead, build 
a fragment of fr, which we write fro. We define fro 

as {(r, tOi) I 3wo.(wQ,r) 6 rf A (tOo,tOi) E ws A Fig. 5. fr derives from rf 
$w'.{{w ,w') E wsA (w',wi) E ws)}. In Fig.[5j (r,Wi) and ws 
is in fro but not (r, 102), because there is a write (wx) in ws between the write 
too from which r reads and w-2,. One can show that (r, to) G fr if (r, to) G fro or 
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input: 7, A output: C ppo 

1 Cppo := 0; foreach S G po(j) A S / do 

2 S = [e]++S" n {e | e is not fence} 

3 chains := [(e, 0)]; R := true 

4 foreach e' G 5' do 

5 T' :— 

6 foreach (e",T") £ chains s.t. there is no r s.t. 

(e", r) £ T' and ((g(e') A g(e") A 7?) => r) do 

7 r e " e ' : — not.relaxA 7 (e", e') 

8 if r e // e /is satisfiable then 

9 Cppo - = Cppo U {^" e // e / = ^ > Ce"e'} 

10 r':=T'U{(e",r e » e 0} 

11 foreach (e, r) £ T" do 

12 if 3r'.(e, r') £ T' then 

13 J? := R A (p <S> r' V (r e » e , A r)) 

14 T':={( e ,p)}UT'\( e ,r') 

15 else T':={(e,v v Ar)}ur' 

16 chains := [(e',T')]++[chains] 

Algorithm 4: Constraints for preserved program order 



there exists w' s.t. (r, «/) G fro and (tt)', w) G ws, i.e., we can generate fr from fro 
and ws. 

5.4 Preserved program order 

For an architecture A and an ses 7, Alg. U encodes the preserved program order 
as the set C ppo . In Sec.[3J the function ppo^, which is part of the definition of A, 
determines if A relaxes a pair (e, e') in program order in a concrete execution. 
For example, RMO and Power relax read-read pairs, but PSO and stronger do 
not. 

We reuse the notation ppo^ for the function collecting non-relaxed pairs in 
symbolic program order. Unlike in Sec. |3l the non-relaxed pairs in symbolic pro- 
gram order also include the internal safe read-from, internal write serialisation, 
internal from-read, and the orderings due to Power's isync fence. We generate 
these constraints here, rather than in Alg. HH21 to limit the redundancies. We 
write ppo A (7) for ppo j4 (po-br(7)), or only ppo^ if 7 is clear from the context. 

Alg. U avoids building redundant transitive closure constraints, taking into 
account the guards of events: for two events ei,e2, we build a constraint iff 
(ei,e2) G ppo A (j). If, e.g., ppo j4 (po-br) = po-br (on SC), Alg. H] creates con- 
straints only for neighbouring events in po-br(7) in each control flow branch of 
the program. 

As SSA and loop unrolling yield po(7) (i.e., lists of symbolic events per 
thread) rather than po-br(7) (the corresponding DAG), we cannot construct 
Cppo by analysing control flow branches of the program. Building C ppo from 
po(7) requires some more work. 
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To build ppo^, Alg. 0] uses the variable chains, a list of pairs (y,T). For a 
given y, its companion set T contains the events x occurring before y in ppo j4 + 
together with a formula r that characterises all paths of ppo^ + between x and 
y. We build r from formulas r e " e / asserting that (e",e') G ppo^, describing 
individual steps (e", e') of a path between x and y. 

We compute the formula r e » e i at line[71 using the function not_relax. Given an 
ses 7 and a pair (e",e'), not_relaxA 7 (e",e') returns a formula r e » e ' expressing 
the condition under which (e",e') is not relaxed. For PSO or stronger models, 
not_relax only needs to take the direction of the events and their addresses into 
account. For instance, TSO relaxes write-read pairs, but nothing else. If a pair 
is necessarily relaxed, not_relax returns false, otherwise not_relaxA 7 (e",e') = 
g(e") A g(e'). For models weaker than PSO, such as Alpha, RMO or Power, 
not.relax has to determine data- and control dependencies, and handle Power's 
isync fence. We resolve data dependencies via a definition-use data flow anal- 
ysis [5] on the program part in program order between the two events. Control 
dependencies use the data dependency analysis to test whether there exists a 
branching instruction in program order between the events such that the branch- 
ing decision is in data dependency with the first event. For isync, the approach 
is similar, except that in addition there must be an isync in program order 
between the branch and the second event. We then add the guard of the fence 
to the conjunction returned by not_relax. 

For a given e', we initialise its companion set T' at line[5j then increment it 
in lines I10H15I In line 1141 we use fresh variables p constrained in the formula R 
(line Q2]) to avoid repeating sub-formulas, as is standard in, e.g., CNF encod- 
ings [16 . In line[7]we compute the condition r e " e ' for (e",e') not being relaxed 
on A for each e" in chains (unless skipped for transitivity, see below). We gener- 
ate the constraint r e " e ' =^ c e " e ' iff r e " e ' is satisfiable (line [9]), i.e., (e",e') is not 
relaxed on A. 

Now, suppose ei, e2, on the same thread all in ppo A ; the companion set of 
ei is {(ei, r ei62 )}, because (ei,e2) S ppo^ and there is no other event before e± 
on the thread. Suppose that Alg. 2] has already built the beginning of the chain 
formed by e±, e2 and so that chains = [(e2, {(ei, r eiG2 )}), (ex, 0)] (observe that 
the chains are in reverse order of po) . At line 0] for each remaining e' on a given 
thread, i.e., e 3 in our example, Alg. 0] follows lines [5HH and adds a constraint 
w.r.t. the immediate predecessor ei of e% in ppo^. The subsequent elements of 
chains (ei in our example) are also candidates for a clock constraint. 

We do not add any constraint if (ei,e3) is guaranteed to be in ppo^ + , as 
follows. Any remaining element of chains that belongs to the companion set T" 
of e" is added to T' at lines [TT|[T5l As an instance, recall that e\ is in the 
companion set of e2- Thus, after generating the constraint c e2ea at line [HI we 
add ei with its transitivity condition r e2&3 A r eie2 to T' at line [131 Then, line[S] 
iterates over the rest of chains, i.e., (ei,0). With the updated set T" the test 
(ei, r ei ) £ T' yields r ei = r e2e3 Ar eie2 , and thus amounts to checking the validity 
of (g(e3) A g(ei)) =>■ {r e2 e 3 A r eie2 ). Remember that, unless there is an isync, 
the conditions r xy amount to conjunctions over guards, hence in our example we 
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are checking the validity of (g(es) A g(ei)) =4> g(es) A g(e2) A g(ei). If all three 
events e\, ei and are on the same control flow branch, the implication is valid 
because all guards are equal. This makes the test of line [5] fail and (ei,es) will 
not be considered for adding another constraint c eie3 , which would have been 
redundant. When the implication is not valid, the test of line [6] succeeds; then 
we add another constraint c eie3 , as this is not redundant here. 

This elaboration on guards is essential as witnessed by the following variant 
of our example: assume, in contrast to the above, that ei is not a dominator of 
e3 on the control flow graph. This might occur in a program fragment (if e\ 
then ei) ; e^, where the guard of ei would be different from that of e\ or e%. If 
we were to skip (ex, 63) as above, the constraints would be insufficient to enforce 
the order of e\ before e^ when g(e2) evaluates to false. In this case, the premises 
g(ei) A g(e2) and g(e2) A g(ea) of c ei62 and c e2e3 , respectively, are false, hence 
the clock constraints clock ei < clock e2 and clock e2 < clock e3 are not enforced, 
leaving the order of (ei,e3) unconstrained. 

We illustrate Alg. @]on the ses 7 of iriw (cf. Fig. [3]). Alg. @] proceeds over 
po(7), equal to {[i , [a,b], [c,d\, [e], [/]} for iriw. Given a non-empty list S 
of po(7), e.g., S = [a, b] corresponding to Pq, the first non-fence event a of S 
initialises at line [3] the variable chains (explained below in detail). The loop at 
line |4] proceeds with the tail S' of the list S. Thus for Pq at this point we have 
chains = [(a, 0)] and Alg. |4]proceeds with S' = [b]. 

The contents of chains depend on the architecture A, as iriw shows. For Pq, 
recall that chains = [(a, 0)] and only b remains in S'. If A relaxes read-read 
pairs, e.g., RMO or weaker, then (a, b) is relaxed. Thus we do not add any clock 
constraint to C ppo at line [9] and eventually chains = [(&, 0), (a, 0)] in line [T6l 
If A does not relax read- read pairs, e.g., PSO or stronger, we add c a b to C ppo 
at line |H] and add a with the guard conjunction true to T' at line 1151 Thus 
chains = [(&, {(a, true)}), (a, 0)]. Let us now characterise the output of Alg. |H 
given an input ses 7: 

Lemma 6. Alg. outputs {r xy =>■ c xy \ (x, y) £ ppo A }. 

Proof. We write L\ (resp. Li) for the loop from line^ to{W{ (resp. [S| to \15)) . L\ 

maintains the invariant that S — id(chains)-\ — \-S', where rd reverses its argument 
and deletes T for each element (e, T) of its argument. We write path^ y {e\, . . . , e„) 
when there is a path from x to y in ppo A {^) passing by e, i.e., e\ — x and 
e n — y and Vz.(ej, e 2 ;+i) G ppo A (-y) and 3i.ei = e. Li maintains the invariant 
that T' = Uee[e" e'l -^e > where e £ [e",e'] means (e",e) E poA(e, e') G po, and 
T i = {(x,r x ) I r x = Vpaths^e!,...^) Ai<i<n r e*e <+1 }- We conclude by double 
inclusion of C ppo and {r xy =>• c xy \ (x,y) £ ppo A }, omitted for brevity. 

Since the r xy are guard conditions, we just need to evaluate the guards to 
evaluate them. We show that Alg. @] gives the clock constraints encoding ppo; 
the proof is immediate by Lem. [51 

Lemma 7. (C, V) satisfies /\ ceCppo c iff it satisfies A( x ,y)e P po A c ^v 
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input: 7, A output: C a y 
1 C M := 0; foreach S G po( 7 ) A S / do 



2 fences := {s | s G S 1 A s is fence} 

3 foreach e 6 S 1 \ fences do 

4 foreach s G fences do 

5 if (e, s) G po(7) then 

6 C ab , := C ab , U{g(s) ^ c es } 

7 if A is not store atomic then 

8 foreach (w, e) &ein<? a w-r pair s.t. addr(u;) = addr(e) and 

tid(u>) / tid(e) do 

9 Cab' := C a b' U {(g(s) A Siue) =>■ C„ s } 

10 else Cab' : = Cab' U {g(s) =>- c se } 

11 if yl is not store atomic then 

12 foreach (e, r) being a w-r pair s.t. addr(e) = addr(r) and 

tid(e) / tid(r) do 

13 Cab' := Cab' U {(g(s) A s er ) => c sr } 

Algorithm 5: Constraints for memory fences 



5.5 Memory fences and cumulativity 

Given an architecture A and an ses 7, Alg. [5] encodes the fence orderings as the 
set C a b'. A fence s potentially induces orderings over all (e, e') s.t. e is in po 
before s and e' after s, which is quadratic in the number of events in po for 
each fence. Cumulativity constraints depend on the read-from to appear in the 
concrete event structure, and again these are paired with all events before or 
after (in po) a fence. We alleviate this with the fence events (see below). The 
implementation supports x86's mfence and Power's sync, lwsync and isync. 
We handle isync as part of ppo in Alg. [4] We first present x86's mfence and 
Power's sync, then lwsync. 

Fences mfence and sync Alg. [3] applies its procedure to po( 7 ) (linefT]). For exam- 
ple, assume sync fences between the read-read pairs of Po and -Pi of iriw, associ- 
ated with the fences events So and Sx- We then have po( 7 ) = {[io, «i], [a, sq, b], [c, si, d], [e], [/]}. 

For each list S of po( 7 ) (i.e., per thread), we compute at line |2] the set fences, 
containing the fence events of S. For iriw, fences is empty for P2 and P3. For Po, 
we have fences = {so}, and {si} for Pi. We test at line[S]for each pair (e, s) s.t. 
e is a non-fence event and s is fence whether (e, s) is in program order, or rather 
(s, e). We then build the according non-cumulative constraints, and constraints 
for A-cumulativity (for (e,s) in program order) or B-cumulativity (otherwise). 

For non-cumulativity, if e is before (resp. after) s in program order, Alg. [5] 
produces at line [5] the clock constraint c es (resp. c se at line [TU|) . In iriw, all 
guards are true, hence we generate c aso (resp. c CSl ) for the event a (resp. c) in po 
before the fence sq (resp. s%) on P (resp. Pi). Line [TU1 generates c SQ b (resp. c Sl d) 
for b (resp. d), in po after the fence sq (resp. s%) on P (resp. Pi). 

If A relaxes store atomicity, we build cumulativity constraints. For A-cumulativity, 
Alg.[5]adds at linelHlthe constraint s we => c ws , for each (w, e) s.t. e is in po before 
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the fence s, and e reads from the write w. The constraint reads "if g(s) is true 
(i.e., the fence is concretely executed) and if s we is true (i.e., e reads from w), 
then c ws is true (i.e., there is a global ordering, due to the fence s, from w to s)" . 
All other constraints, i.e., the actual ordering of w before some event e' in po 
after s, follow by transitivity. We handle B-cumulativity in a similar way, given 
in lines r~ ] and [T3] 

As Power relaxes store atomicity, the sync fences between the read-read pairs 
of iriw create A-cumulativity constraints, namely for s (and analogous ones for 

If we were not using fence events, we would create a clock constraint c we i for 
every e' in program order after the fence s to implement Sec. [3l for each fence s. 
Thus the non-cumulative part would be cubic already, whereas fence events yield 
a quadratic number at most. For cumulativity, we would obtain a constraint for 
every pair (r, e') s.t. (w, r) £ rfe and r is in po before the fence s. The resulting 
number of constraints is the number of such pairs (r, e') times the number of 
pairs (w, r), i.e., cubic in the number of events per fence s. Furthermore cases of 
both A- and B-cumulativity at the same fence s need to be taken into account, 
resulting in even higher complexity. Fence events, however, reduce all these cases, 
including the combined one, to cubic complexity (all triples of external writes, 
reads, and fence events). 

Fence Iwsync As lwsync does not order write-read 
pairs (cf. Sec. [3]), we need to avoid creating a con- 
straint c wr between a write w and a read r separated 

by an lwsync. To do so, we use two distinct clock vari- lwsync r lwsync^ 
ables clocks and clock™ for an lwsync s. This avoids 
the wrong transitive constraint c wr implied by c ws 
and c sr . Fig. [5] illustrates this setup: the write-read 
pair (wi,r2) will not be ordered by any of the con- 
straints, but all other pairs are ordered. ^. 

_ , . .... _____ __ b lg. 6. Constraints tor 

lo create a clock constraint in lines 161 191 1101 or!131 lwsync 

we then pick one or both of the clock variables, as fol- 
lows. If e is a read, the clock constraint is clock e < clock, when e is before s, 
i.e., lines [6] or [9] (or clock^ < clock e if e is after, i.e., lines [TOl or [T3|). If e is a write 
preceding s (i.e., lines [5] or ^ , the clock constraint is clock e < clock™. Finally, 
if e is a write after s, i.e., lines [TUl or [~fl the clock constraint is the conjunc- 
tion (clock™ < clock e ) A (clock', < clock e ). To make lwsync non-cumulative (cf. 
footnote in Sec. [3]), we just need to disable the lines 18191121 and [T51 

In iriw, if we use lwsync instead of sync as discussed above, we obtain the 
following constraints: (clock a < clock, o ) A (clock^ < clocks) A (s i()a => clocks < 
clock™ )A(s ea clock e < dock™,)- These constraints will not order the writes to 
or e with the read b, because io and e are ordered w.r.t. to clock™ , but b is only 
ordered w.r.t. the distinct clock^ Q . This corresponds to the fact that placing 
lwsync fences in iriw does not forbid the non-SC execution. 

Given an ses 7, Alg. [5] outputs Cm. We let rfe (7) be {(w, r) £ \J a WR Q | 
tid(w) ^ tid(r) A s wr }. We write at/ (7) for {(ei,e2) | nc'(ei,e2) V ac'(ei,e2) V 
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bc'(ei,e2)}, where nc'(ei,e2) corresponds to non-cumulativity, i.e., (ei,e2) € 
po(7)A((g(ei)Aei is fence) V(g(e2)Ae2 is fence))Anot both e\ and e<i are fences, 
ac'(ei,e 2 ) to A-cumulativity, i.e., 3r.(ei,r) £ rfe(7) A (r, £ po(7) A g(ea) A 
e2 is fence, and bc'(ei, e?) corresponds to B-cumulativity, i.e., 3w.(w, e%) £ rfe(7)A 
(ei,w) £ po(j) A g(ei) A e\ is fence. We show that Alg. Ogives the clock con- 
straints encoding ab'. The proof is immediate like for Lem. [5] 

Lemma 8. (C, V, WR) satisfies A c ec a6 , c iff( C i s^ws/ies A(x,»)e;nst( a ft'(7),WR) 

We let ab(7) be the symbolic version of ab in Sec. |3l i.e., we let nc(ei, s, e^) 
be g(s) A s is fence A (ei, s) £ po("f) A (s, 62) E po(7), ac(ei, s, 62) be 3r.(ei, r) £ 
rfe(7) Anc(r, s, e2) and bc(ei, s, e-i) be 3w. nc(ei, s, to) A ei) £ rfe(7). We only 
prove this encoding sound w.r.t. Sec. [31 as a b is more fine-grained than ab (to 
see why, note that one cannot express nc'(ei,e2) as a combination of nc, ac or 
be). Yet we prove our overall encoding complete in Thm. [TJ 

Lemma 9. If(C, V, WR) satisfies A ce c ab , c then ( C ' ^0 satisfies /\( eu e 2 )einst{ab(-y),WR) c eie 2 - 

Proof. We give only the case lwsync(7). Take (ei,e2) £ lwsync(7), i.e., there is 
an lwsync s s.t. nc(ei,s, 62) or ac(ei, s, e?) or bc(ei, s, 62). In the tic case, we 
know that s is a fence and g(s) is true, and (e%,s) £ po(j) and (s,e2) £ po(^), 
i.e., nc'(ei,s) and nc'(s, ei\ Thus c eiS and c se2 are inC a y. Now, by definition 
of lwsync(7), (ei,e 2 ) ^ WR. For (ei,e2) £ WW, c eiS is clock ei < clock™ and 
c se2 is clock™ < clock e2 . Thus clock ei < clock e2 , i.e., c ei62 holds. Writing RR 
for the read-read pairs, take (e±, e2) £ RR- Thus c eiS is clock ei < clock r s and c se2 
is clock r s < clock e2 . Hence clock ei < clock e2 , i.e., c ei£2 holds. For (ei,e2) £ RW, 
c eiS is clock ei < clock r s and c se2 is (clock™ < clock e2 ) A (clock r s < clock e2 ). Thus 
clock ei < clock e2 , i.e., c eie2 holds. In the ac case, s is a fence and g(s) is true, 
and there is r s.t. {&x,r) £ rfe("f) and nc(r, s, 62). Thus e% is a write (source 
of a rf), and since (ei,e2) WR (by definition of lwsync(7)J, ei is a write. 
So ac'(ei,s), and nc'(s, ei), i.e., c eiS and c se2 hold. We are back to the WW 
case. In the be case, s is a fence and g(s) is true, and there is w s.t. nc(ei, s, w) 
and (w, 62) £ ^(7)- Thus is a read (target of a rf), and since (e%, e2) $ WR, 
ei is a read. So nc'(ei,s) and be (s, e-i), i.e., c eiS and c se2 hold. We are back to 
the RR case. 

5.6 Soundness and completeness of the encoding 

Given an architecture A and a program, the procedure of Sec. [4] and Sec. [5] 
outputs a formula ssa A pord and an ses 7. This formula provably encodes the 
executions of this program valid on A and violating the property encoded in ssa 
in a sound and complete way. Proving this requires proving that any assignment 
to the system corresponds to a valid execution of the program, and vice versa. 
This result requires three steps, one for uniproc, one for thin and one for the 
acyclicity of ghb. By lack of space, we show only the last one. Given an ses 7, 
we write <\> for A ceCppo uc grf uc wf uc ws uc ab , c: 
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Theorem 1. The formula ssa A <f> is satis fiable iff there are V, a valuation to 
the symbols of ssa, and a well formed X s.t. gh i)^ (cone (7, V),X) is acyclic and 
has finite prefixes. 

Proof. Let (C, V, WR) be a satisfying assignment of ssa A <p. By Lem. El 
[5| and [21 we know that (C, V, WR) satisfies <f> iff i) for all r s.t. g(r) is true, 
there is w s.t. (w,r) G inst(rf(^), WR) and ii) for all (w,r) G inst(rf(j), WR), 
g(w) is true and val(w) = val(r) and Hi) (C, V, WR) satisfies <f>(ppo A (^)) A 
cf>(inst(grf(-j), WR)) A 0(W( 7 )) A (j>(inst(fr(~t), WR)) A <j>{inst(ab' '(7), WR)). 

=X- Tafce X = (conc(/>7St(/f(7), IV/?), V)), (cone (1*5(7), V))- Note that i) and 
ii), together with Lem.\3^ imply that rf(X) is well formed. For ws(X), this comes 
from the totality of ws(j) over writes to the same address, implied by the shape of 
C W s ( cf Alg. 0) for the external ws, and by the totality of po(j) for the internal 
ws. 

By Lem. and\^ in) says that (C, V, WR) satisfies 4>{r), with r = ppo A {^j) U 
inst(grf A (i), WR) U wsfr) U inst(fr(-/), WR) U inst(ab(j), WR). By Lem. {J\ since 
g/7i> j4 (conc(7, V),X) is conc(r, V), we have our result. 

<=: We letE be conc( 7 , V). Take WR s.t. s wr is true iff(w,r) e rf(X). rf{X) 
being well-formed implies i) and ii). 

We let ghb' A (E, X) be r x U ab'{E, X), with r x = ppo A (E)Ugrf A (X) U ws(X) U 
fr(E, X). Note that ghb A (E, X) is r\\Jab(E, X). We show below that the acyclic- 
ity of ghb A (E,X) implies the acyclicity of ghb' A (E,X) (idem for finite pre- 
fixes). Then we take r = ghb' A (E,X) in Lem. [H and take C as in Lem. [7J 
Hence we have (C, V) satisfying <f)(ppo A (^)) A 4>(inst(grf(j), WR)) A 0(n/se( 7 )) A 
4>(inst{fr(^), WR)) A <fi(inst(ab' (7), WR)), namely Hi); our result follows. 

We let r2 = {(ei,e2) G ab';ab' | neither e\ nor e2 is a fence}). We write 
(x, y) G r : r' for 3z.(x, z) G r A {z, y) G r'. One can show: (*) if (ex, e-i) G ghb' , 
we have (ei,e 2 ) in ab'; U r 2 + ) + , or (r x + U r 2 + ) + ; ab' , or (rx + U r 2 + ) + . 

Acyclicity: by contradiction, take a cycle in ghb' A (E,X), i.e., x s.t. (x,x) G 
(ghb' A (E,X)) . In the first two cases (*), ab' connects two non-fence events, 
a contradiction. Hence a cycle in ghb' implies one in r x U r 2 , i.e., in ghb since 
r 2 C ab. 

Prefixes: as a contradiction, take an infinite path in ghb' A (E, X) + . Only the 
cases (r x + U r 2 + ) + and ab' ; (r x + U r 2 + ) + of (*) apply, and both imply an infinite 
path in (r x + U r 2 + ) . Hence, we have an infinite prefix in ghb + , since r 2 C ab. 

To decide the satisfiability of </>, we can use any solver supporting a sufficiently 
rich fragment of first-order logic. The procedure reveals the concrete executions, 
as expressed by Thm. [TJ 

5.7 Comparison to [14] and [6T1I62] 

Both [Tl] and [61162) use an SSA encoding similar to our ssa of Sec. S) The 
difference resides in the ordering constraints. 

[14] encodes total orders over memory accesses. Thus, in contrast to our 
clock variables with less-than constraints, [14] uses a Boolean variable M xy per 



23 



pair (x, y), whose value places x and y in a total order: either x before y, or y 
before x. Prog. [T] has 3 • N memory accesses per thread, hence Q3]'s encoding 
has 6 • N • (6 • N — 1) Boolean variables. [T3] builds additional constraints for the 
transitive closure; their number is at least cubic in the number of variables M xy , 
leading to C(N 6 ) constraints. 

We only consider relations per address, except for program order and fence 
orderings, and do not build transitive closures. The constraints for fr and ab are 
cubic in the worst case; all others are quadratic. In Prog. [I] the write serialisation 
is internal, hence fr is only quadratic. Hence our number of constraints is C(N 2 ). 

[61162] use partial orders like us; they note redundancies in their constraints 
in [52] but do not explain them, which we do below. Basically, [61162] quantify 
over all events regardless of their address, whereas we mostly build constraints 
per address. Fig. [7] shows that the maximal number of events to a single address 
is experimentally much smaller than the total number of events. 

Our notations correspond to the ones of [52] as follows (the original de- 
scription [5T] has different notations). HB(a, b) is our clock constraint c a b- The 
functions addr and val map to ours; en(:r) is our g(x); link(r, w) denotes that r 
reads from w, i.e., our s wr . [62] expresses po, rf, fr, and ws as follows (since it is 
restricted to SC, it gives no encoding of ppo A , grf A , and ab^). 

[62] encodes po as the conjunction of the c 0i0 . , with in po before cij. If the 
implementation of [62] strictly follows this definition, it redundantly includes the 
transitive closure constraints, which we avoid by building the transitive reduction 
in Alg. H 

[62] encodes rf in TTi := Vr3w.g(r) (g(iw) A s wr ) and n 2 '■= Vr.Vw.s wr =>■ 
(c wr A addr(r) = addr(ui) A val(r) = val(iu)). [62] forces rf to be exclusive. We 
explained in Sec. l5.1l whv this is unnecessary in our case, which allows us to only 
build a disjunction over writes (cf. Alg. [TJ) linear in their number. 

77 2 combines our value and clock constraints, with one major difference: 7T 2 
ranges over all reads and writes, regardless of their address. Our rf (Alg. [TJ) 
ranges over pairs to the same address, thus reaches this number only when all 
reads and writes have the same memory address, which is unlikely in non-trivial 
programs. 

Ranging over the same address, as we do, and not all addresses, as in [62] . 
becomes even more advantageous in 77 3 , encoding fr: 7T 3 := Vr. \/w.\/w' .(s wr => 
(g(ui') A ^c w , w A -ic rtU ' =>■ addr(r) ^ addr(ui')). 7T 3 ranges over all (r,w,w'), 
again independently of their addresses. For distinct addresses the conjunction 
holds trivially, but [55] builds it nevertheless. Our fr (cf. Alg. quantifies only 
over the same address, thus spares these trivial constraints. 

[52] does not encode ws. The totality of ws comes as a side effect: [55] ini- 
tialises each write with a unique integer, hence writes are totally ordered by < 
over integers. This is again regardless of the addresses, whereas we order writes 
to the same address only. 
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6 Experimental Results 



We detail here our experiments, which indicate that our technique is scalable 
enough to verify non-trivial, real-world concurrent systems code, including the 
worker-synchronisation logic of the relational database PostgreSQL, code for 
socket-handover in the Apache httpd, and the core API of the Read-Copy- 
Updatc mutual exclusion code from Linux 3.2.21. 

We implement our technique within the bounded model checker CBMC [TH] , 
using a SAT solver as an underlying decision procedure. We see two primary 
comparison points to estimate the overhead introduced by the partial order 
constraints. First, we pass the benchmarks with a single, fixed interleaving 
to sequential CBMC. Our implementation performs comparably to sequential 
CBMC, as Fig. [7]shows (rows "sequential" and "concurrent"). Second, we com- 
pare to ESBMC [19], which also implements bounded model checking, but uses 
inter leaving-based techniques. 

In Fig. [7J we gather facts about all examples: the Fibonacci example from 
pTj with N=5, 4500 litmus tests (see below), the worker synchronisation in Post- 
greSQL, RCU, and fdqueue in Apache httpd. For each we give the number of lines 
of code (LOC), the number of distinct memory addresses "tot. addr" (including 
unused shared variables), the total number of shared accesses "tot. shared", the 
maximal number of accesses to a single address "same addr" , the total num- 
ber of constraints "all constr" and the relation with the most costly encoding, in 
terms of the number of constraints generated. We give the loop unrolling bounds 
"unroll": we write "none" when there is no loop, and "bounded" when the loops 
in the program are natively bounded. 

The total number of shared accesses is on average 13 times the maximal 
number of accesses to a single address. The most costly constraint is usually the 
read-from, or the barriers, which build on read-from. The time needed by our 
tool to analyse a program grows with the total number of constraints generated. 
ESBMC is 4 times slower than our tool on Fibonacci, 3050 times slower on the 
litmus tests, times out on PostgreSQL, and cannot parse RCU and Apache. 
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Fig. 7. Facts about all examples 
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V 


V 
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conv err 
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t/o 


parse err 


t/o 
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V 


V 
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conv err 
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parse err 


parse err 


ref err 


n/a 
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V 


V 


V 


conv err 


aborts 


parse err 


parse err 


aborts 


n/a 



Fig. 8. Comparison of all tools on all examples (time out 30mins) 



Other tools There are very few tools for verifying concurrent C programs, even 
on SC [21 j - For weak memory, existing techniques are restricted to TSO, and its 
siblings PSO and RMO [14I44I43I9I3T39"] . Not all of them have been implemented, 
and only few handle systems code given as C programs. 

Thus, as a further comparison point, we implemented an instrumentation 
technique [7,, similar to [9]. The technique of [9] is restricted to TSO, and consists 
in delaying writes, so that the SC executions of the instrumented code simulate 
the TSO executions of the original program. Our instrumentation handles all 
the models of Sec. O 

We tried 5 ANSI-C model checkers: SatAbs, a verifier based on predicate 
abstraction [T7]; ESBMC; CImpact, a variant of the Impact algorithm [S5] ex- 
tended to SC concurrency; Threader, a thread- modular verifier [34] ; and Poirot, 
which implements a context-bounded translation to sequential programs |45j . 
These tools cover a broad range of techniques for verifying SC programs. We 
also tried CheckFence [14]. 

In Fig. [8j we compare all tools on all examples: F for Prog. [1] L for the 
litmus tests, P for PostgreSQL with its bug, Pf for our fix, R for RCU and A for 
Apache. For L, P, R and A, the bounds are as in Fig. [7] for Pf we take the one of 
P. For F we try the maximal N that the tool can handle within the time out of 
30 mins. For each tool, we specify the model below. We write "t/o" when there 
is a timeout. We write "fail" when the tool gives a wrong answer. CheckFence 
provides a conversion module from C to its internal representation; we write 
"conv err" when it fails. We write "parse err" when the tool cannot parse the 
example. SatAbs uses a refinement procedure; we write "ref err" when it fails. 
When a tool verifies an example we write "V" ; when it finds a counterexample 
we write "CE" . 

Fibonacci All tools, except for ESBMC, SatAbs and ours, fail to analyse Fi- 
bonacci. Poirot claims the assertion is violated for any N, which is not the case 
for 1 < N < 5. SatAbs does not reach beyond N = 4. Our tool handles more 
than N = 300, which is 30 times more loop unrolling than ESBMC, within the 
same amount of time. 

Litmus tests We analyse 4500 tests exposing weak memory artefacts, e.g., in- 
struction reordering, store buffering, store atomicity relaxation. These tests are 
generated by the diy tool jS], which generates assembly programs with a final 
state unreachable on SC, but reachable on a weaker model. For example, iriw 
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(Fig. [5]) can only be reached on RMO (by reordering the reads) or on Power 
(idem, or because the writes are non- atomic) . 

We convert these tests into C code, of 50 lines on average, involving 2 to 4 
threads. Despite the small size of the tests, they prove challenging to verify, as 
Fig. [8] shows: most tools, except Blender, SatAbs and ours, give wrong results 
or fail in other ways on a vast majority of tests, even for SC. For each tool we 
give the average percentage of correct results over all models. Our tool verifies 
all tests on all models in 0.22 s on average. 

PostqreSQL Developers observed that a regression test failed on a PowerPC ma- 
chinqj, and later identified the memory model as possible culprit: the processor 
could delay a write by a thread until after a token signalling the end of this 
thread's work had been set. Our tool confirmed the bug, and proved a patch we 
proposed. A detailed description of the problem is in [TJ. 

RCU Read-Copy-Update (RCU) is a synchronisation mechanism of the Linux 
kernel, introduced in version 2.5. Writers to a concurrent data structure prepare 
a fresh component (e.g., list element), then replace the existing component by 
adjusting the pointer variable linking to it. Clean-up of the old component is 
delayed until there is no process reading. 

Thus readers can rely on very lightweight (and thus fast) lock-free synchro- 
nisation only. The protection of reads against concurrent writes is fence-free on 
x86, and uses only a light-weight fence (lwsync) on Power. We verify the original 
implementation of the 3.2.21 kernel for x86 (5824 lines) and Power (5834 lines) 
in less than 1 s, using a harness that asserts that the reader will not obtain an 
inconsistent version of the component. On Power, removing the lwsync makes 
the assertion fail. 

Apache The Apache httpd is the most widely used HTTP server software. It 
supports a broad range of concurrency APIs distributing incoming requests to a 
pool of workers. 

The fdqueue module (28864 lines) is the central part of this mechanism, 
which implements the hand-over of a socket together with a memory pool to an 
idle worker. The implementation uses a central, shared queue for this purpose. 
Shared access is primarily synchronised by means of an integer keeping track 
of the number of idle workers, which is updated via architecture-dependent 
compare- and-swap and atomic decrement operations. Hand-over of the socket 
and the pool and wake-up of the idle thread is then coordinated by means of a 
conventional, heavy-weight mutex and a signal. We verify that hand-over guar- 
antees consistency of the payload data passed to the worker in 2.45 s on x86 and 
2.8 s on Power. 



4 http://archives.postgresql.org/pgsql-hackers/2011-08/msg00330.php 
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7 Conclusion 



Our experiments demonstrate that weakness is a virtue for programs with bounded 
loops. Our proofs suggest that this contention is not limited to bounded loops, 
but impracticable as is, since it involves infinite structures. Thus we believe that 
this work opens up new possibilities for over-approximation for programs with 
unbounded loops, which we hope to investigate in the future. 

Acknowledgements We would like to thank Lihao Liang and Alex Horn for 
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